{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Composite Estimators using Pipeline & FeatureUnions\n",
    "\n",
    "<hr>\n",
    "\n",
    "### Agenda\n",
    "1. Introduction to Composite Estimators\n",
    "2. Pipelines\n",
    "3. TransformedTargetRegressor\n",
    "4. FeatureUnions\n",
    "5. ColumnTransformer\n",
    "6. GridSearch on pipeline\n",
    "\n",
    "PS: scikit version 0.20\n",
    "\n",
    "<hr>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1. Introduction to Composite Estimators\n",
    "* One or more transformers are connected to estimators resulting into composite estimator.\n",
    "* Composite transformer is implemented using Pipeline\n",
    "* FeatureUnion is used to concatenate output of transformers to create derived feature\n",
    "* Pipeline make machine learning code reuseable & modular"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2. Pipeline\n",
    "* Before data is fed to learning algorithm, it needs to be handled for missing values.\n",
    "* Different pre-processing needs to be done.\n",
    "* The output of preprocessor is to be subjected to next preprocessor & finally the estimator\n",
    "* This whole process can be automated using Pipeline\n",
    "\n",
    "<img src=\"https://github.com/awantik/machine-learning-slides/blob/master/pipeline-ml2.png?raw=true\">\n",
    "\n",
    "* Intermediate steps .i.e transformers must implement fit & transform\n",
    "* The same trained pipeline can used for prediction"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Predicting horror author from text "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "horror_train_data = pd.read_csv('data/horror-train.csv')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>text</th>\n",
       "      <th>author</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>id26305</td>\n",
       "      <td>This process, however, afforded me no means of...</td>\n",
       "      <td>EAP</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>id17569</td>\n",
       "      <td>It never once occurred to me that the fumbling...</td>\n",
       "      <td>HPL</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>id11008</td>\n",
       "      <td>In his left hand was a gold snuff box, from wh...</td>\n",
       "      <td>EAP</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>id27763</td>\n",
       "      <td>How lovely is spring As we looked from Windsor...</td>\n",
       "      <td>MWS</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>id12958</td>\n",
       "      <td>Finding nothing else, not even gold, the Super...</td>\n",
       "      <td>HPL</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        id                                               text author\n",
       "0  id26305  This process, however, afforded me no means of...    EAP\n",
       "1  id17569  It never once occurred to me that the fumbling...    HPL\n",
       "2  id11008  In his left hand was a gold snuff box, from wh...    EAP\n",
       "3  id27763  How lovely is spring As we looked from Windsor...    MWS\n",
       "4  id12958  Finding nothing else, not even gold, the Super...    HPL"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "horror_train_data.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "horror_test_data= pd.read_csv('data/horror-test.csv')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 8392 entries, 0 to 8391\n",
      "Data columns (total 2 columns):\n",
      "id      8392 non-null object\n",
      "text    8392 non-null object\n",
      "dtypes: object(2)\n",
      "memory usage: 131.2+ KB\n"
     ]
    }
   ],
   "source": [
    "horror_test_data.info()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 19579 entries, 0 to 19578\n",
      "Data columns (total 3 columns):\n",
      "id        19579 non-null object\n",
      "text      19579 non-null object\n",
      "author    19579 non-null object\n",
      "dtypes: object(3)\n",
      "memory usage: 459.0+ KB\n"
     ]
    }
   ],
   "source": [
    "horror_train_data.info()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "horror_train_data = horror_train_data[['text','author']]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.pipeline import make_pipeline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.linear_model import LogisticRegression\n",
    "from sklearn.tree import DecisionTreeClassifier\n",
    "from sklearn.naive_bayes import MultinomialNB\n",
    "from sklearn.svm import LinearSVC"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "pipelines = []\n",
    "for model in [LogisticRegression(), DecisionTreeClassifier(), MultinomialNB(), LinearSVC()]:\n",
    "    pipeline = make_pipeline(\n",
    "              CountVectorizer(stop_words='english'),\n",
    "              TfidfTransformer(),\n",
    "              model)\n",
    "    pipelines.append(pipeline)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "('decisiontreeclassifier',\n",
       " DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,\n",
       "             max_features=None, max_leaf_nodes=None,\n",
       "             min_impurity_decrease=0.0, min_impurity_split=None,\n",
       "             min_samples_leaf=1, min_samples_split=2,\n",
       "             min_weight_fraction_leaf=0.0, presort=False, random_state=None,\n",
       "             splitter='best'))"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pipelines[1].steps[2]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import train_test_split"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [],
   "source": [
    "trainX,testX,trainY,testY = train_test_split(horror_train_data.text, horror_train_data.author)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/awantik/anaconda3/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:433: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.\n",
      "  FutureWarning)\n",
      "/home/awantik/anaconda3/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:460: FutureWarning: Default multi_class will be changed to 'auto' in 0.22. Specify the multi_class option to silence this warning.\n",
      "  \"this warning.\", FutureWarning)\n"
     ]
    }
   ],
   "source": [
    "for pipeline in pipelines:\n",
    "    pipeline.fit(trainX, trainY)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.7883554647599591\n",
      "0.610214504596527\n",
      "0.8102145045965271\n",
      "0.8018386108273748\n"
     ]
    }
   ],
   "source": [
    "for pipeline in pipelines:\n",
    "    print (pipeline.score(testX, testY))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 8392 entries, 0 to 8391\n",
      "Data columns (total 2 columns):\n",
      "id      8392 non-null object\n",
      "text    8392 non-null object\n",
      "dtypes: object(2)\n",
      "memory usage: 131.2+ KB\n"
     ]
    }
   ],
   "source": [
    "horror_test_data.info()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [],
   "source": [
    "results = []\n",
    "for pipeline in pipelines:\n",
    "    result = pipeline.predict(horror_test_data.text)\n",
    "    results.append(result)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[array(['MWS', 'EAP', 'HPL', ..., 'EAP', 'MWS', 'EAP'], dtype=object),\n",
       " array(['MWS', 'EAP', 'HPL', ..., 'EAP', 'MWS', 'HPL'], dtype='<U3'),\n",
       " array(['MWS', 'EAP', 'HPL', ..., 'EAP', 'MWS', 'HPL'], dtype=object)]"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "results"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<8392x22075 sparse matrix of type '<class 'numpy.int64'>'\n",
       "\twith 88741 stored elements in Compressed Sparse Row format>"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pipelines[0].steps[0][1].transform(horror_test_data.text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Caching transformers within a Pipeline\n",
    "* Storing state of transformers is also possible to prevent recomputation of transformers\n",
    "* When pipeline is subjected to GridSearch situations like this happens"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import GridSearchCV"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [],
   "source": [
    "svc_pipe =  make_pipeline(\n",
    "              CountVectorizer(stop_words='english'),\n",
    "              TfidfTransformer(),\n",
    "              LinearSVC())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [],
   "source": [
    "dt_pipe = make_pipeline(\n",
    "              CountVectorizer(stop_words='english'),\n",
    "              TfidfTransformer(),\n",
    "              DecisionTreeClassifier())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Pipeline(memory=None,\n",
       "     steps=[('countvectorizer', CountVectorizer(analyzer='word', binary=False, decode_error='strict',\n",
       "        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',\n",
       "        lowercase=True, max_df=1.0, max_features=None, min_df=1,\n",
       "        ngram_range=(1, 1), preprocessor=None, stop_words='english...ax_iter=1000,\n",
       "     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,\n",
       "     verbose=0))])"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "svc_pipe"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('countvectorizer',\n",
       "  CountVectorizer(analyzer='word', binary=False, decode_error='strict',\n",
       "          dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',\n",
       "          lowercase=True, max_df=1.0, max_features=None, min_df=1,\n",
       "          ngram_range=(1, 1), preprocessor=None, stop_words='english',\n",
       "          strip_accents=None, token_pattern='(?u)\\\\b\\\\w\\\\w+\\\\b',\n",
       "          tokenizer=None, vocabulary=None)),\n",
       " ('tfidftransformer',\n",
       "  TfidfTransformer(norm='l2', smooth_idf=True, sublinear_tf=False, use_idf=True)),\n",
       " ('linearsvc',\n",
       "  LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,\n",
       "       intercept_scaling=1, loss='squared_hinge', max_iter=1000,\n",
       "       multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,\n",
       "       verbose=0))]"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "svc_pipe.steps"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "params = {\n",
    "    'linearsvc__C': list(np.logspace(1,20,20))\n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "('decisiontreeclassifier',\n",
       " DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,\n",
       "             max_features=None, max_leaf_nodes=None,\n",
       "             min_impurity_decrease=0.0, min_impurity_split=None,\n",
       "             min_samples_leaf=1, min_samples_split=2,\n",
       "             min_weight_fraction_leaf=0.0, presort=False, random_state=None,\n",
       "             splitter='best'))"
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dt_pipe.steps[2]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {},
   "outputs": [],
   "source": [
    "params = {\n",
    "    'countvectorizer__max_features':[5000,7500,10000],\n",
    "    'decisiontreeclassifier__max_depth':[100,200]\n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {},
   "outputs": [],
   "source": [
    "gs = GridSearchCV(dt_pipe,cv=5,param_grid=params, n_jobs=-1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "GridSearchCV(cv=5, error_score='raise-deprecating',\n",
       "       estimator=Pipeline(memory=None,\n",
       "     steps=[('countvectorizer', CountVectorizer(analyzer='word', binary=False, decode_error='strict',\n",
       "        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',\n",
       "        lowercase=True, max_df=1.0, max_features=None, min_df=1,\n",
       "        ngram_range=(1, 1), preprocessor=None, stop_words='english...      min_weight_fraction_leaf=0.0, presort=False, random_state=None,\n",
       "            splitter='best'))]),\n",
       "       fit_params=None, iid='warn', n_jobs=-1,\n",
       "       param_grid={'countvectorizer__max_features': [5000, 7500, 10000], 'decisiontreeclassifier__max_depth': [100, 200]},\n",
       "       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',\n",
       "       scoring=None, verbose=0)"
      ]
     },
     "execution_count": 46,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "gs.fit(trainX,trainY)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'countvectorizer__max_features': 5000,\n",
       " 'decisiontreeclassifier__max_depth': 200}"
      ]
     },
     "execution_count": 47,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "gs.best_params_"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.6007218741487333"
      ]
     },
     "execution_count": 48,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "gs.best_score_"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "%timeit gs.fit(trainX,trainY)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from tempfile import mkdtemp\n",
    "from shutil import rmtree\n",
    "from sklearn.utils import Memory\n",
    "\n",
    "cachedir = mkdtemp()\n",
    "memory = Memory(location=cachedir, verbose=0)\n",
    "svc_pipe_cached =  make_pipeline(\n",
    "              CountVectorizer(stop_words='english'),\n",
    "              TfidfTransformer(),\n",
    "              LinearSVC(), memory = memory)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "gs_cached = GridSearchCV(svc_pipe_cached,cv=2,param_grid=params, verbose=0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%timeit gs_cached.fit(trainX,trainY)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3. Transforming target in regression\n",
    "* Dependent variables & independent variables should be linearly related\n",
    "* In case, dependent variable is not normally distribted. We can make it happen for better error.\n",
    "* The prediction also needs to be remapped\n",
    "* This entire process can be automated using TransformedTargetRegressor"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.datasets import load_boston\n",
    "from sklearn.linear_model import LinearRegression\n",
    "from sklearn.model_selection import train_test_split"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {},
   "outputs": [],
   "source": [
    "boston = load_boston()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "metadata": {},
   "outputs": [],
   "source": [
    "X = boston.data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "metadata": {},
   "outputs": [],
   "source": [
    "y = boston.target"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "metadata": {},
   "outputs": [],
   "source": [
    "regressor = LinearRegression()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,\n",
       "         normalize=False)"
      ]
     },
     "execution_count": 55,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "regressor.fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "R2 score: 0.64\n"
     ]
    }
   ],
   "source": [
    "print('R2 score: {0:.2f}'.format(regressor.score(X_test, y_test)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "metadata": {},
   "outputs": [],
   "source": [
    "pred = regressor.predict(X_test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.metrics import mean_absolute_error, r2_score"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "3.6683301481357113"
      ]
     },
     "execution_count": 59,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "mean_absolute_error(y_pred=pred, y_true=y_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Convert data from non-normal distribution to normal distribution"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 61,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.preprocessing import PowerTransformer,QuantileTransformer"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 62,
   "metadata": {},
   "outputs": [],
   "source": [
    "pt = PowerTransformer()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 63,
   "metadata": {},
   "outputs": [],
   "source": [
    "qt = QuantileTransformer(output_distribution='normal')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 64,
   "metadata": {},
   "outputs": [],
   "source": [
    "#X_tf = pt.fit_transform(X)\n",
    "#OR\n",
    "X_tf = qt.fit_transform(X)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 65,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_train, X_test, y_train, y_test = train_test_split(X_tf, y, random_state=0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 66,
   "metadata": {},
   "outputs": [],
   "source": [
    "regressor = LinearRegression()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 67,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,\n",
       "         normalize=False)"
      ]
     },
     "execution_count": 67,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "regressor.fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 68,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "R2 score: 0.66\n"
     ]
    }
   ],
   "source": [
    "print('R2 score: {0:.2f}'.format(regressor.score(X_test, y_test)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 69,
   "metadata": {},
   "outputs": [],
   "source": [
    "pred = regressor.predict(X_test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 70,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "3.6331157010096167"
      ]
     },
     "execution_count": 70,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "mean_absolute_error(y_pred=pred, y_true=y_test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 71,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.compose import TransformedTargetRegressor"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 72,
   "metadata": {},
   "outputs": [],
   "source": [
    "regr = TransformedTargetRegressor(regressor=regressor,transformer=qt)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 73,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "TransformedTargetRegressor(check_inverse=True, func=None, inverse_func=None,\n",
       "              regressor=LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,\n",
       "         normalize=False),\n",
       "              transformer=QuantileTransformer(copy=True, ignore_implicit_zeros=False, n_quantiles=1000,\n",
       "          output_distribution='normal', random_state=None,\n",
       "          subsample=100000))"
      ]
     },
     "execution_count": 73,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "regr.fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 74,
   "metadata": {},
   "outputs": [],
   "source": [
    "pred = regr.predict(X_test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 75,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "3.350934118150315"
      ]
     },
     "execution_count": 75,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "mean_absolute_error(y_pred=pred, y_true=y_test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 76,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.7104191534298219"
      ]
     },
     "execution_count": 76,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "r2_score(y_pred=pred, y_true=y_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Hyper-parameters of TransformedTargetRegressor\n",
    "* regressor - initialized model\n",
    "* transformer - which supports transform & inverse_transform functions\n",
    "* function - to convert target \n",
    "* inverse_function - to convert back predicted target in original data scale"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4. FeatureUnion\n",
    "* It combines several transformer objects into one transformer\n",
    "* Transformers are executed in parallel\n",
    "* During fitting, each of these are fit parallelly\n",
    "* During transform, output is concatenated parallely"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Predicting employee exit - The Pipeline & FeatureUnion Way"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 143,
   "metadata": {},
   "outputs": [],
   "source": [
    "emp_data = pd.read_csv('https://raw.githubusercontent.com/zekelabs/data-science-complete-tutorial/master/Data/HR_comma_sep.csv.txt')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 145,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>satisfaction_level</th>\n",
       "      <th>last_evaluation</th>\n",
       "      <th>number_project</th>\n",
       "      <th>average_montly_hours</th>\n",
       "      <th>time_spend_company</th>\n",
       "      <th>Work_accident</th>\n",
       "      <th>left</th>\n",
       "      <th>promotion_last_5years</th>\n",
       "      <th>sales</th>\n",
       "      <th>salary</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0.38</td>\n",
       "      <td>0.53</td>\n",
       "      <td>2</td>\n",
       "      <td>157</td>\n",
       "      <td>3</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>sales</td>\n",
       "      <td>low</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0.80</td>\n",
       "      <td>0.86</td>\n",
       "      <td>5</td>\n",
       "      <td>262</td>\n",
       "      <td>6</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>sales</td>\n",
       "      <td>medium</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0.11</td>\n",
       "      <td>0.88</td>\n",
       "      <td>7</td>\n",
       "      <td>272</td>\n",
       "      <td>4</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>sales</td>\n",
       "      <td>medium</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0.72</td>\n",
       "      <td>0.87</td>\n",
       "      <td>5</td>\n",
       "      <td>223</td>\n",
       "      <td>5</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>sales</td>\n",
       "      <td>low</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0.37</td>\n",
       "      <td>0.52</td>\n",
       "      <td>2</td>\n",
       "      <td>159</td>\n",
       "      <td>3</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>sales</td>\n",
       "      <td>low</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   satisfaction_level  last_evaluation  number_project  average_montly_hours  \\\n",
       "0                0.38             0.53               2                   157   \n",
       "1                0.80             0.86               5                   262   \n",
       "2                0.11             0.88               7                   272   \n",
       "3                0.72             0.87               5                   223   \n",
       "4                0.37             0.52               2                   159   \n",
       "\n",
       "   time_spend_company  Work_accident  left  promotion_last_5years  sales  \\\n",
       "0                   3              0     1                      0  sales   \n",
       "1                   6              0     1                      0  sales   \n",
       "2                   4              0     1                      0  sales   \n",
       "3                   5              0     1                      0  sales   \n",
       "4                   3              0     1                      0  sales   \n",
       "\n",
       "   salary  \n",
       "0     low  \n",
       "1  medium  \n",
       "2  medium  \n",
       "3     low  \n",
       "4     low  "
      ]
     },
     "execution_count": 145,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "emp_data.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 146,
   "metadata": {},
   "outputs": [],
   "source": [
    "emp_data.rename(columns={'sales':'dept'}, inplace=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 151,
   "metadata": {},
   "outputs": [],
   "source": [
    "num_cols = ['number_project','average_montly_hours','time_spend_company']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 163,
   "metadata": {},
   "outputs": [],
   "source": [
    "bin_cols = ['Work_accident','promotion_last_5years']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 225,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.pipeline import Pipeline, FeatureUnion\n",
    "from sklearn.base import BaseEstimator, TransformerMixin\n",
    "from sklearn.preprocessing import OneHotEncoder,LabelEncoder, LabelBinarizer, MinMaxScaler\n",
    "from sklearn.feature_extraction.text import CountVectorizer\n",
    "from sklearn.feature_selection import SelectKBest"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 182,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.linear_model import LogisticRegression\n",
    "from sklearn.ensemble import RandomForestClassifier\n",
    "from sklearn.naive_bayes import GaussianNB"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 154,
   "metadata": {},
   "outputs": [],
   "source": [
    "class ItemSelector(BaseEstimator, TransformerMixin):\n",
    "    def __init__(self,key):\n",
    "        self.key = key\n",
    "        \n",
    "    def fit(self,X,Y=None):\n",
    "        return self\n",
    "    \n",
    "    def transform(self,X,Y=None):\n",
    "        return X[self.key]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 155,
   "metadata": {},
   "outputs": [],
   "source": [
    "class MyLabelBinarizer(TransformerMixin):\n",
    "    def __init__(self, *args, **kwargs):\n",
    "        self.encoder = LabelBinarizer(*args, **kwargs)\n",
    "    def fit(self, x, y=0):\n",
    "        self.encoder.fit(x)\n",
    "        return self\n",
    "    def transform(self, x, y=0):\n",
    "        return self.encoder.transform(x)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 156,
   "metadata": {},
   "outputs": [],
   "source": [
    "pipeline_dept = Pipeline([\n",
    "    ('selector', ItemSelector('dept')),\n",
    "    ('lb', MyLabelBinarizer()),\n",
    "])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 157,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[0, 0, 0, ..., 1, 0, 0],\n",
       "       [0, 0, 0, ..., 1, 0, 0],\n",
       "       [0, 0, 0, ..., 1, 0, 0],\n",
       "       ...,\n",
       "       [0, 0, 0, ..., 0, 1, 0],\n",
       "       [0, 0, 0, ..., 0, 1, 0],\n",
       "       [0, 0, 0, ..., 0, 1, 0]])"
      ]
     },
     "execution_count": 157,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pipeline_dept.fit_transform(emp_data)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 158,
   "metadata": {},
   "outputs": [],
   "source": [
    "class MultiItemSelector(BaseEstimator, TransformerMixin):\n",
    "    def __init__(self,keys):\n",
    "        self.keys = keys\n",
    "        \n",
    "    def fit(self,X,Y=None):\n",
    "        return self\n",
    "    \n",
    "    def transform(self,X,Y=None):\n",
    "        return X[self.keys]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 159,
   "metadata": {},
   "outputs": [],
   "source": [
    "class SalaryMapper(BaseEstimator, TransformerMixin):\n",
    "    \n",
    "    def fit(self,X,Y=None):\n",
    "        return self\n",
    "    \n",
    "    def transform(self,X,Y=None):\n",
    "        db = {'low':1,'medium':2,'high':3}\n",
    "        print (type(X))\n",
    "        r = X.str.strip().replace(db)\n",
    "        return r.values.reshape(-1,1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 160,
   "metadata": {},
   "outputs": [],
   "source": [
    "pipeline_salary = Pipeline([\n",
    "    ('selector',ItemSelector('salary')),\n",
    "    ('sm',SalaryMapper())\n",
    "])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 162,
   "metadata": {},
   "outputs": [],
   "source": [
    "pipeline_numbers = Pipeline([\n",
    "    ('selector',MultiItemSelector(num_cols)),\n",
    "    ('scaling', MinMaxScaler())\n",
    "])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 165,
   "metadata": {},
   "outputs": [],
   "source": [
    "pipeline_bin = Pipeline([\n",
    "    ('selector',MultiItemSelector(bin_cols))\n",
    "])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 166,
   "metadata": {},
   "outputs": [],
   "source": [
    "fu = FeatureUnion([\n",
    "    ('dept_pipe',pipeline_dept),\n",
    "    ('salary_pipe',pipeline_salary),\n",
    "    ('numbers_pipe',pipeline_numbers),\n",
    "    ('bin_pipe',pipeline_bin)\n",
    "])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 200,
   "metadata": {},
   "outputs": [],
   "source": [
    "pipeline = Pipeline([\n",
    "    ('union',fu),\n",
    "    #('feature_selector',SelectKBest(k=15)),\n",
    "    ('classifier',RandomForestClassifier(n_estimators=10))\n",
    "])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 201,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import train_test_split"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 202,
   "metadata": {},
   "outputs": [],
   "source": [
    "trainX,testX, trainY,testY = train_test_split(emp_data.drop('left',axis=1), emp_data.left)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 203,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.series.Series'>\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "C:\\Users\\awant\\Anaconda3\\lib\\site-packages\\sklearn\\preprocessing\\data.py:323: DataConversionWarning: Data with input dtype int64 were all converted to float64 by MinMaxScaler.\n",
      "  return self.partial_fit(X, y)\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "Pipeline(memory=None,\n",
       "     steps=[('union', FeatureUnion(n_jobs=None,\n",
       "       transformer_list=[('dept_pipe', Pipeline(memory=None,\n",
       "     steps=[('selector', ItemSelector(key='dept')), ('lb', <__main__.MyLabelBinarizer object at 0x00000200323DC748>)])), ('salary_pipe', Pipeline(memory=None,\n",
       "     steps=[('selector', ItemSelector...obs=None,\n",
       "            oob_score=False, random_state=None, verbose=0,\n",
       "            warm_start=False))])"
      ]
     },
     "execution_count": 203,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pipeline.fit(trainX,trainY)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 204,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.series.Series'>\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "array([0, 0, 1, ..., 0, 0, 0], dtype=int64)"
      ]
     },
     "execution_count": 204,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pipeline.predict(testX)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 205,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.series.Series'>\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "0.9664"
      ]
     },
     "execution_count": 205,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pipeline.score(testX,testY)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 5. ColumnTransformer ( Beta stage )\n",
    "* Datasets consist of hetrogenous types of columns\n",
    "* An easy technique to map column to pipeline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 254,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.compose import ColumnTransformer\n",
    "from sklearn.pipeline import Pipeline\n",
    "from sklearn.impute import SimpleImputer\n",
    "from sklearn.preprocessing import StandardScaler, OneHotEncoder"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 251,
   "metadata": {},
   "outputs": [],
   "source": [
    "titanic_data = pd.read_csv('https://raw.githubusercontent.com/zekelabs/data-science-complete-tutorial/master/Data/titanic-train.csv.txt', index_col='PassengerId')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 252,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Survived</th>\n",
       "      <th>Pclass</th>\n",
       "      <th>Name</th>\n",
       "      <th>Sex</th>\n",
       "      <th>Age</th>\n",
       "      <th>SibSp</th>\n",
       "      <th>Parch</th>\n",
       "      <th>Ticket</th>\n",
       "      <th>Fare</th>\n",
       "      <th>Cabin</th>\n",
       "      <th>Embarked</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>PassengerId</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>Braund, Mr. Owen Harris</td>\n",
       "      <td>male</td>\n",
       "      <td>22.0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>A/5 21171</td>\n",
       "      <td>7.2500</td>\n",
       "      <td>NaN</td>\n",
       "      <td>S</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>Cumings, Mrs. John Bradley (Florence Briggs Th...</td>\n",
       "      <td>female</td>\n",
       "      <td>38.0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>PC 17599</td>\n",
       "      <td>71.2833</td>\n",
       "      <td>C85</td>\n",
       "      <td>C</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>Heikkinen, Miss. Laina</td>\n",
       "      <td>female</td>\n",
       "      <td>26.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>STON/O2. 3101282</td>\n",
       "      <td>7.9250</td>\n",
       "      <td>NaN</td>\n",
       "      <td>S</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>Futrelle, Mrs. Jacques Heath (Lily May Peel)</td>\n",
       "      <td>female</td>\n",
       "      <td>35.0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>113803</td>\n",
       "      <td>53.1000</td>\n",
       "      <td>C123</td>\n",
       "      <td>S</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>Allen, Mr. William Henry</td>\n",
       "      <td>male</td>\n",
       "      <td>35.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>373450</td>\n",
       "      <td>8.0500</td>\n",
       "      <td>NaN</td>\n",
       "      <td>S</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "             Survived  Pclass  \\\n",
       "PassengerId                     \n",
       "1                   0       3   \n",
       "2                   1       1   \n",
       "3                   1       3   \n",
       "4                   1       1   \n",
       "5                   0       3   \n",
       "\n",
       "                                                          Name     Sex   Age  \\\n",
       "PassengerId                                                                    \n",
       "1                                      Braund, Mr. Owen Harris    male  22.0   \n",
       "2            Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0   \n",
       "3                                       Heikkinen, Miss. Laina  female  26.0   \n",
       "4                 Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0   \n",
       "5                                     Allen, Mr. William Henry    male  35.0   \n",
       "\n",
       "             SibSp  Parch            Ticket     Fare Cabin Embarked  \n",
       "PassengerId                                                          \n",
       "1                1      0         A/5 21171   7.2500   NaN        S  \n",
       "2                1      0          PC 17599  71.2833   C85        C  \n",
       "3                0      0  STON/O2. 3101282   7.9250   NaN        S  \n",
       "4                1      0            113803  53.1000  C123        S  \n",
       "5                0      0            373450   8.0500   NaN        S  "
      ]
     },
     "execution_count": 252,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "titanic_data.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 256,
   "metadata": {},
   "outputs": [],
   "source": [
    "num_cols = ['Age','Fare']\n",
    "cat_cols = ['Embarked','Sex','Pclass']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 318,
   "metadata": {},
   "outputs": [],
   "source": [
    "pipeline_num = Pipeline(steps=[\n",
    "    ('imputer', SimpleImputer(strategy='median')),\n",
    "    ('scaling',StandardScaler())\n",
    "])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 319,
   "metadata": {},
   "outputs": [],
   "source": [
    "pipeline_cat = Pipeline(steps=[\n",
    "    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),\n",
    "    ('encoding', OneHotEncoder(handle_unknown='ignore'))\n",
    "])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 320,
   "metadata": {},
   "outputs": [],
   "source": [
    "preprocessor = ColumnTransformer(\n",
    "    transformers=[\n",
    "        ('num', pipeline_num, num_cols),\n",
    "        ('cat', pipeline_cat, cat_cols)])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 321,
   "metadata": {},
   "outputs": [],
   "source": [
    "pipeline = Pipeline(steps=[('preprocessor',preprocessor),\n",
    "                ('classifier',RandomForestClassifier(n_estimators=10))])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 322,
   "metadata": {},
   "outputs": [],
   "source": [
    "X = titanic_data.drop('Survived',axis=1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 323,
   "metadata": {},
   "outputs": [],
   "source": [
    "Y = titanic_data.Survived"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 324,
   "metadata": {},
   "outputs": [],
   "source": [
    "trainX,testX,trainY,testY = train_test_split(X,Y)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 325,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Pipeline(memory=None,\n",
       "     steps=[('preprocessor', ColumnTransformer(n_jobs=None, remainder='drop', sparse_threshold=0.3,\n",
       "         transformer_weights=None,\n",
       "         transformers=[('num', Pipeline(memory=None,\n",
       "     steps=[('imputer', SimpleImputer(copy=True, fill_value=None, missing_values=nan,\n",
       "       strategy='median', verbo...obs=None,\n",
       "            oob_score=False, random_state=None, verbose=0,\n",
       "            warm_start=False))])"
      ]
     },
     "execution_count": 325,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pipeline.fit(trainX,trainY)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 326,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.852017937219731"
      ]
     },
     "execution_count": 326,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pipeline.score(testX,testY)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 6. GridSearch for pipelines\n",
    "* Pipelines consist of combination of transformers & estimators\n",
    "* Both transformers & estimators are configured hyper-parameters as a fine tuning process"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 311,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('preprocessor',\n",
       "  ColumnTransformer(n_jobs=None, remainder='drop', sparse_threshold=0.3,\n",
       "           transformer_weights=None,\n",
       "           transformers=[('num', Pipeline(memory=None,\n",
       "       steps=[('imputer', SimpleImputer(copy=True, fill_value=None, missing_values=nan, strategy='mean',\n",
       "         verbose=0)), ('scaling', StandardScaler(copy=True, with_mean=True, with_std=True))]), ['Age', 'Fare']), ('cat', Pipeline(memory=None,\n",
       "       steps=[...4'>, handle_unknown='ignore',\n",
       "         n_values=None, sparse=True))]), ['Embarked', 'Sex', 'Pclass'])])),\n",
       " ('classifier',\n",
       "  RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',\n",
       "              max_depth=None, max_features='auto', max_leaf_nodes=None,\n",
       "              min_impurity_decrease=0.0, min_impurity_split=None,\n",
       "              min_samples_leaf=1, min_samples_split=2,\n",
       "              min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,\n",
       "              oob_score=False, random_state=None, verbose=0,\n",
       "              warm_start=False))]"
      ]
     },
     "execution_count": 311,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pipeline.steps"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 327,
   "metadata": {},
   "outputs": [],
   "source": [
    "param_grid = {\n",
    "    'preprocessor__num__imputer__strategy': ['mean', 'median'],\n",
    "    'classifier__n_estimators': [10,15,20],\n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 328,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import GridSearchCV"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 329,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "GridSearchCV(cv=5, error_score='raise-deprecating',\n",
       "       estimator=Pipeline(memory=None,\n",
       "     steps=[('preprocessor', ColumnTransformer(n_jobs=None, remainder='drop', sparse_threshold=0.3,\n",
       "         transformer_weights=None,\n",
       "         transformers=[('num', Pipeline(memory=None,\n",
       "     steps=[('imputer', SimpleImputer(copy=True, fill_value=None, missing_values=nan,\n",
       "       strategy='median', verbo...obs=None,\n",
       "            oob_score=False, random_state=None, verbose=0,\n",
       "            warm_start=False))]),\n",
       "       fit_params=None, iid=False, n_jobs=None,\n",
       "       param_grid={'preprocessor__num__imputer__strategy': ['mean', 'median'], 'classifier__n_estimators': [10, 15, 20]},\n",
       "       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',\n",
       "       scoring=None, verbose=0)"
      ]
     },
     "execution_count": 329,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "grid_search = GridSearchCV(pipeline, param_grid, cv=5, iid=False)\n",
    "grid_search.fit(trainX,trainY)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 330,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.8295964125560538"
      ]
     },
     "execution_count": 330,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "grid_search.score(testX,testY)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 331,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'classifier__n_estimators': 20,\n",
       " 'preprocessor__num__imputer__strategy': 'mean'}"
      ]
     },
     "execution_count": 331,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "grid_search.best_params_"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
